Informal Question: Is there a recipe for the perfect high school?


Proposal

The foundation of this project is to find evidence that supports the idea that graduation rate can be predicted.

Common Belief

Currently the most widely held belief among parents and teachers is that the classroom size and individual teachers make the largest difference in whether a student graduates or not. Having taught in an inner city high school for a year, I have to agree with these beliefs however, I haven’t seen any hard evidence to say this with any confidence.

Outline

  1. Data Source & Description
  2. Cleaning the Data
  3. Exploratory Data Analysis
  4. Principal Component Analysis
  5. Tree Models
  6. Major Findings
  7. Conclusion

I will detail the data source I used, followed by what I did to clean and filter the data for my specific goal. I will then briefly discuss some general statistics surrounding the formatted data set, and conduct Principal Component Analysis to group the variables based on the largest variance. This will be followed by a more focused learning method, tree models, which will lead into my major findings and conclusions.


Reading in the Data Set

library(readr)
MA_Public_Schools_2017 <- read_csv("../Data/MA_Public_Schools_2017.csv")
View(MA_Public_Schools_2017)

The Data

I found the data set on Kaggle, titled Massachusetts Public Schools Data. It was compiled by Nigel Dalziel and is actually available online through the State Report shown here. The only difference being that it is in numerous pieces so can be tedious to collate all of the data into one report.

Massachusetts Public School Online Report - http://profiles.doe.mass.edu/

Massachusetts Public School Online Report - http://profiles.doe.mass.edu/

Description

In terms of the actual data, the data set is correct as of August 2017 and contains a list of public schools in Massachusetts with their respective information. Specifically there are 1861 schools in the report with 302 variables detailing each school. Variables include enrollment by gender and race, class sizes, teacher salaries, graduation rate, test scores, and much more. Note that there is only one record per school, so in this data set I cannot compare the performance of schools over time.


Cleaning the Data

Step 1

First I filtered the data to only include traditional four year high schools by searching for records whereby the enrollment for grades 9, 10, 11 and 12 are greater than 0.

# Creating a duplicate of the original data set to then filter and modify
Public_Schools_Subset<-MA_Public_Schools_2017[,]
# Removing all schools that have no enrollment for 9th, 10th, 11th or 12th Grade
Public_High_Schools_Subset<-Public_Schools_Subset[which(Public_Schools_Subset$`9_Enrollment`>0 & Public_Schools_Subset$`10_Enrollment`>0 & Public_Schools_Subset$`11_Enrollment`>0 & Public_Schools_Subset$`12_Enrollment`>0),]

Step 2

Secondly, since I am looking at finding variables to determine graduation rate, I removed records with a missing graduation rate as opposed to imputing those values.

# Removing records that have a missing field for Graduation Rate
Public_High_Schools_Subset<-Public_High_Schools_Subset[complete.cases(-Public_High_Schools_Subset$`% Graduated`), ]

Step 3

With my intentions being to create a decision tree, I then created a binary yes/no variable to indicate whether graduation rate was greater than 80%. I chose this value based on the average graduation rate in the data set. Above 80% indicates the school is performing above average compared to other schools in Massachusetts.

# Creating a binary variable for Graduation Rate greater than or equal to 80%
Public_High_Schools_Subset$Graduation_80_Percent<-ifelse(Public_High_Schools_Subset$`% Graduated`>=80, 'Yes', 'No')

Step 4

Lastly, I removed unnecessary variables such as testing scores, college acceptance etc. The information was interesting, however, it was not relevant for the scope of this project.

# Removing unecessary variables
Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(-1,-4,-5,-6,-7,-9,-11,-12,-15:-25,-32,-34,-36,-38,-40,-71:-302)]

Final Data Set

# Printing the dimensions of the final data set
dim(Public_High_Schools_Subset)
## [1] 365  47

So, in terms of the final data set, there are 365 high schools, with 47 plausible variables that impact graduation rate.

# Changing string variables to factors to summarize
Public_High_Schools_Subset$`School Name`<-as.factor(Public_High_Schools_Subset$`School Name`)
Public_High_Schools_Subset$`School Type`<-as.factor(Public_High_Schools_Subset$`School Type`)
Public_High_Schools_Subset$Town<-as.factor(Public_High_Schools_Subset$Town)
Public_High_Schools_Subset$`District Name`<-as.factor(Public_High_Schools_Subset$`District Name`)
Public_High_Schools_Subset$Graduation_80_Percent<-as.factor(Public_High_Schools_Subset$Graduation_80_Percent)

# Summarizing the cleaned data set
summary(Public_High_Schools_Subset)
##                                            School Name 
##  A North Central Charter Essential (District)    :  1  
##  Abby Kelley Foster Charter Public School        :  1  
##  Abington High                                   :  1  
##  Academy Of the Pacific Rim Charter Public School:  1  
##  Acton-Boxborough Regional High                  :  1  
##  Advanced Math and Science Academy Charter School:  1  
##  (Other)                                         :359  
##          School Type           Town          Zip          Grade          
##  Charter School: 33   Dorchester :  9   Min.   :1001   Length:365        
##  Public School :332   Springfield:  9   1st Qu.:1520   Class :character  
##                       Boston     :  8   Median :1937   Mode  :character  
##                       Worcester  :  8   Mean   :1889                     
##                       Roxbury    :  6   3rd Qu.:2151                     
##                       Brockton   :  5   Max.   :2780                     
##                       (Other)    :320                                    
##      District Name  9_Enrollment    10_Enrollment    11_Enrollment
##  Boston     : 30   Min.   :   1.0   Min.   :   2.0   Min.   :  2  
##  Springfield:  9   1st Qu.:  89.0   1st Qu.:  85.0   1st Qu.: 79  
##  Worcester  :  7   Median : 171.0   Median : 162.0   Median :160  
##  Brockton   :  5   Mean   : 200.9   Mean   : 194.7   Mean   :190  
##  Lynn       :  4   3rd Qu.: 283.0   3rd Qu.: 290.0   3rd Qu.:286  
##  Chicopee   :  3   Max.   :1223.0   Max.   :1048.0   Max.   :978  
##  (Other)    :307                                                  
##  12_Enrollment   SP_Enrollment    TOTAL_Enrollment
##  Min.   :  3.0   Min.   : 0.000   Min.   :  13.0  
##  1st Qu.: 83.0   1st Qu.: 0.000   1st Qu.: 447.0  
##  Median :158.0   Median : 0.000   Median : 727.0  
##  Mean   :185.6   Mean   : 3.658   Mean   : 845.4  
##  3rd Qu.:268.0   3rd Qu.: 4.000   3rd Qu.:1188.0  
##  Max.   :993.0   Max.   :43.000   Max.   :4264.0  
##                                                   
##  % First Language Not English % English Language Learner
##  Min.   :  0.00               Min.   : 0.000            
##  1st Qu.:  2.40               1st Qu.: 0.500            
##  Median :  7.20               Median : 1.500            
##  Mean   : 16.81               Mean   : 6.158            
##  3rd Qu.: 24.40               3rd Qu.: 6.900            
##  Max.   :100.00               Max.   :79.600            
##                                                         
##  % Students With Disabilities  % High Needs   
##  Min.   :  0.00               Min.   : 11.70  
##  1st Qu.: 12.50               1st Qu.: 23.80  
##  Median : 15.80               Median : 37.50  
##  Mean   : 19.37               Mean   : 43.78  
##  3rd Qu.: 20.40               3rd Qu.: 58.90  
##  Max.   :100.00               Max.   :100.00  
##                                               
##  % Economically Disadvantaged % African American    % Asian      
##  Min.   : 3.10                Min.   : 0.00      Min.   : 0.000  
##  1st Qu.:13.10                1st Qu.: 1.50      1st Qu.: 1.000  
##  Median :24.80                Median : 3.50      Median : 2.200  
##  Mean   :30.44                Mean   :11.09      Mean   : 4.586  
##  3rd Qu.:43.90                3rd Qu.:11.90      3rd Qu.: 5.400  
##  Max.   :93.90                Max.   :80.50      Max.   :58.400  
##                                                                  
##    % Hispanic       % White      % Native American
##  Min.   : 0.20   Min.   : 0.60   Min.   :0.0000   
##  1st Qu.: 3.60   1st Qu.:41.70   1st Qu.:0.0000   
##  Median : 7.50   Median :76.30   Median :0.1000   
##  Mean   :17.51   Mean   :63.74   Mean   :0.2288   
##  3rd Qu.:26.00   3rd Qu.:88.70   3rd Qu.:0.3000   
##  Max.   :95.50   Max.   :97.90   Max.   :5.1000   
##                                                   
##  % Native Hawaiian, Pacific Islander % Multi-Race, Non-Hispanic
##  Min.   :0.00000                     Min.   : 0.000            
##  1st Qu.:0.00000                     1st Qu.: 1.600            
##  Median :0.00000                     Median : 2.400            
##  Mean   :0.09178                     Mean   : 2.754            
##  3rd Qu.:0.10000                     3rd Qu.: 3.700            
##  Max.   :1.20000                     Max.   :11.000            
##                                                                
##     % Males        % Females     Total # of Classes Average Class Size
##  Min.   :28.20   Min.   :11.40   Min.   :   3.0     Min.   : 3.80     
##  1st Qu.:48.60   1st Qu.:46.50   1st Qu.: 246.5     1st Qu.:13.50     
##  Median :50.50   Median :49.50   Median : 398.0     Median :15.60     
##  Mean   :51.83   Mean   :48.15   Mean   : 456.2     Mean   :15.46     
##  3rd Qu.:53.50   3rd Qu.:51.30   3rd Qu.: 607.5     3rd Qu.:17.40     
##  Max.   :88.60   Max.   :71.40   Max.   :2001.0     Max.   :34.00     
##                                  NA's   :2          NA's   :2         
##  Number of Students Salary Totals       Average Salary     FTE Count     
##  Min.   :  24.0     Min.   :  2198543   Min.   : 53763   Min.   :  33.0  
##  1st Qu.: 443.5     1st Qu.:  9369238   1st Qu.: 68502   1st Qu.: 128.2  
##  Median : 732.0     Median : 17502695   Median : 73599   Median : 238.0  
##  Mean   : 850.9     Mean   : 59588344   Mean   : 74398   Mean   : 731.8  
##  3rd Qu.:1197.5     3rd Qu.: 38245292   3rd Qu.: 79006   3rd Qu.: 512.0  
##  Max.   :4327.0     Max.   :383866184   Max.   :100731   Max.   :4323.0  
##  NA's   :2          NA's   :35          NA's   :35       NA's   :35      
##  In-District Expenditures Total In-district FTEs
##  Min.   :5.920e+06        Min.   :  425.4       
##  1st Qu.:2.345e+07        1st Qu.: 1628.0       
##  Median :4.310e+07        Median : 3110.9       
##  Mean   :1.627e+08        Mean   : 9658.6       
##  3rd Qu.:9.048e+07        3rd Qu.: 7017.1       
##  Max.   :1.093e+09        Max.   :56858.8       
##  NA's   :35               NA's   :35            
##  Average In-District Expenditures per Pupil Total Expenditures 
##  Min.   : 9452                              Min.   :6.864e+06  
##  1st Qu.:12881                              1st Qu.:2.579e+07  
##  Median :13881                              Median :4.741e+07  
##  Mean   :15038                              Mean   :1.862e+08  
##  3rd Qu.:17145                              3rd Qu.:1.030e+08  
##  Max.   :28227                              Max.   :1.270e+09  
##  NA's   :35                                 NA's   :35         
##  Total Pupil FTEs  Average Expenditures per Pupil  # in Cohort    
##  Min.   :  425.4   Min.   :10400                  Min.   :   6.0  
##  1st Qu.: 1760.4   1st Qu.:13457                  1st Qu.:  84.0  
##  Median : 3253.7   Median :14325                  Median : 165.0  
##  Mean   :10872.5   Mean   :15448                  Mean   : 195.1  
##  3rd Qu.: 7533.3   3rd Qu.:17524                  3rd Qu.: 273.0  
##  Max.   :65964.6   Max.   :28208                  Max.   :1013.0  
##  NA's   :35        NA's   :35                                     
##   % Graduated     % Still in School % Non-Grad Completers     % GED       
##  Min.   :  8.70   Min.   : 0.00     Min.   : 0.000        Min.   : 0.000  
##  1st Qu.: 83.30   1st Qu.: 1.40     1st Qu.: 0.000        1st Qu.: 0.000  
##  Median : 92.30   Median : 3.20     Median : 0.000        Median : 0.600  
##  Mean   : 84.07   Mean   : 6.69     Mean   : 1.179        Mean   : 1.106  
##  3rd Qu.: 96.20   3rd Qu.: 7.70     3rd Qu.: 1.000        3rd Qu.: 1.300  
##  Max.   :100.00   Max.   :66.70     Max.   :30.800        Max.   :19.000  
##                                                                           
##  % Dropped Out   % Permanently Excluded High School Graduates (#)
##  Min.   : 0.00   Min.   :0.00000        Min.   :  4.00           
##  1st Qu.: 1.00   1st Qu.:0.00000        1st Qu.: 78.25           
##  Median : 2.90   Median :0.00000        Median :159.00           
##  Mean   : 6.91   Mean   :0.04329        Mean   :178.44           
##  3rd Qu.: 7.50   3rd Qu.:0.00000        3rd Qu.:262.75           
##  Max.   :71.40   Max.   :1.90000        Max.   :897.00           
##                                         NA's   :7                
##  Graduation_80_Percent
##  No : 79              
##  Yes:286              
##                       
##                       
##                       
##                       
## 

Moving on to the exploratory data analysis, I was initially concerned that my cleaned data set was going to lack variation due to its size. However, based on the following findings, I felt comfortable that there was enough variety to find interesting conclusions in the data. For example within high schools:

  • The proportion of students defined as High Needs varies from 11.70% to 100%
  • The proportion of students defined as economically disadvantaged varies from 3.1% to 93%
  • The proportion of white students varied from 0.6% to 97.60%. Males from 28.2% to 88.6%
  • Class sizes from 4 students to 34 students
  • Lastly the target variable, graduation rate varied from 8.7% to 100%.

These are the variables I instinctively thought may impact graduation rate, so having variation to study means the conclusions can be more generalized.

How many schools are above 80% Graduation Rate?

# Calculating proporion of records with graduation above 80%
prop.table(table(Public_High_Schools_Subset$Graduation_80_Percent))
## 
##        No       Yes 
## 0.2164384 0.7835616
# Approximately 22% of my data set falls below 80% graduation rate, 78% are equal to or above it

Clustering Model

Purpose and Method

In order to get a better grasp on the types of schools within this data set I used a cluster model and highlighted the distinguishing features of each group. I chose principal component analysis to form my groups since this is the method I am most comfortable with. I could have also chosen k-mode since my final variables are all numeric. As I said I wanted to see the types of schools, but I also wanted to see if there were any interesting relationships between variables. For this model, I based the number of groups both on the scree plot and on the percentage of variance accounted for being greater than 80% to prevent a loss of variance.

Preparation for Principal Component Analysis

# Checking structure of Subset
str(Public_High_Schools_Subset)
## Classes 'tbl_df', 'tbl' and 'data.frame':    365 obs. of  47 variables:
##  $ School Name                               : Factor w/ 365 levels "A North Central Charter Essential (District)",..: 3 7 9 10 12 15 16 19 20 21 ...
##  $ School Type                               : Factor w/ 2 levels "Charter School",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Town                                      : Factor w/ 245 levels "Abington","Acton",..: 1 4 5 5 7 8 10 12 13 14 ...
##  $ Zip                                       : int  2351 1001 1913 1913 1810 2476 1721 2703 1501 2322 ...
##  $ Grade                                     : chr  "09,10,11,12" "09,10,11,12" "09,10,11,12" "09,10,11,12" ...
##  $ District Name                             : Factor w/ 287 levels "Abby Kelley Foster Charter Public (District)",..: 2 7 8 8 10 11 13 16 17 18 ...
##  $ 9_Enrollment                              : int  124 299 147 7 446 332 202 449 174 53 ...
##  $ 10_Enrollment                             : int  109 309 138 2 459 350 172 401 197 49 ...
##  $ 11_Enrollment                             : int  123 293 145 11 421 312 183 419 175 48 ...
##  $ 12_Enrollment                             : int  92 315 163 11 462 295 187 392 157 45 ...
##  $ SP_Enrollment                             : int  4 6 1 0 18 1 0 10 4 4 ...
##  $ TOTAL_Enrollment                          : int  452 1222 594 31 1806 1290 744 1671 795 319 ...
##  $ % First Language Not English              : num  5.3 4.6 2.9 0 9.5 12.4 11.3 10.2 5.7 7.2 ...
##  $ % English Language Learner                : num  2.4 1.3 0.5 0 0.8 0.9 1.7 2.3 2.1 2.2 ...
##  $ % Students With Disabilities              : num  9.7 14.1 17 51.6 16.1 11.1 13.2 14.2 9.6 16.3 ...
##  $ % High Needs                              : num  28.8 32 25.9 83.9 20.9 20.7 22.8 35.5 23.8 34.8 ...
##  $ % Economically Disadvantaged              : num  21.5 22.7 14.6 74.2 6.3 10.3 10.3 25.6 15.2 23.8 ...
##  $ % African American                        : num  2.2 1.2 1.3 0 1.9 4.1 2 5.5 2.1 37.3 ...
##  $ % Asian                                   : num  1.5 2.2 1.2 0 14.5 10.7 9 4.1 4.3 5.6 ...
##  $ % Hispanic                                : num  9.1 5.8 4.2 6.5 5 5.7 9.4 12.5 6.4 4.7 ...
##  $ % White                                   : num  85.8 88.8 90.7 87.1 76.3 75.6 77.2 73.6 84 48.3 ...
##  $ % Native American                         : num  0.2 0 0 0 0.1 0 0.3 0.3 0.4 0.6 ...
##  $ % Native Hawaiian, Pacific Islander       : num  0.2 0.1 0 0 0 0.2 0 0.2 0 0 ...
##  $ % Multi-Race, Non-Hispanic                : num  0.9 1.9 2.5 6.5 2.2 3.7 2.2 3.8 2.8 3.4 ...
##  $ % Males                                   : num  45.6 52 53.5 64.5 48.9 49.1 48.8 54.2 48.1 48.3 ...
##  $ % Females                                 : num  54.4 48 46.5 35.5 51.1 50.9 51.2 45.8 51.9 51.4 ...
##  $ Total # of Classes                        : num  204 590 380 42 1160 682 431 930 561 218 ...
##  $ Average Class Size                        : num  15.8 16.8 16.7 7.6 14.7 14.3 14.6 18.1 17.3 13.3 ...
##  $ Number of Students                        : num  451 1242 621 33 1799 ...
##  $ Salary Totals                             : num  9489496 20849537 12158388 12158388 37560123 ...
##  $ Average Salary                            : num  74662 64769 77147 77147 81211 ...
##  $ FTE Count                                 : num  127 322 158 158 463 360 184 397 173 65 ...
##  $ In-District Expenditures                  : num  23365711 53871628 29541450 29541450 90103626 ...
##  $ Total In-district FTEs                    : num  1939 3977 2272 2272 6128 ...
##  $ Average In-District Expenditures per Pupil: num  12050 13546 13001 13001 14703 ...
##  $ Total Expenditures                        : num  27229101 59044279 33668081 33668081 97309356 ...
##  $ Total Pupil FTEs                          : num  2052 4111 2445 2445 6237 ...
##  $ Average Expenditures per Pupil            : num  13271 14363 13772 13772 15602 ...
##  $ # in Cohort                               : num  114 325 163 9 441 318 184 411 160 48 ...
##  $ % Graduated                               : num  94.7 94.2 93.9 66.7 95.7 98.1 95.1 91 96.9 83.3 ...
##  $ % Still in School                         : num  0.9 1.8 4.3 22.2 3.4 0.3 0.5 3.4 1.3 8.3 ...
##  $ % Non-Grad Completers                     : num  0 0 0 0 0 0.3 1.6 1 0.6 2.1 ...
##  $ % GED                                     : num  0.9 0.3 0.6 0 0.5 0 1.6 0.5 0.6 0 ...
##  $ % Dropped Out                             : num  3.5 3.7 1.2 11.1 0.5 1.3 1.1 4.1 0.6 6.3 ...
##  $ % Permanently Excluded                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ High School Graduates (#)                 : num  124 288 124 NA 413 306 162 358 158 41 ...
##  $ Graduation_80_Percent                     : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 2 2 2 2 2 ...
# Removing character variables
Public_High_Schools_Subset_PCA<-Public_High_Schools_Subset[,c(-1:-6,-47)]

# Removing Missing Records
Public_High_Schools_Subset_PCA<-na.omit(Public_High_Schools_Subset_PCA)

# Preventing numbers from being displaying in scientific notation
options(scipen = 999)

Conducting Principal Component Analysis

# Conducting PCA
principal_component_analysis_normalized<-prcomp(Public_High_Schools_Subset_PCA, scale. = T)

# Summary of PCA
summary(principal_component_analysis_normalized)
## Importance of components%s:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.5784 3.0320 1.99754 1.53509 1.09849 1.08533
## Proportion of Variance 0.3201 0.2298 0.09975 0.05891 0.03017 0.02945
## Cumulative Proportion  0.3201 0.5500 0.64970 0.70861 0.73878 0.76823
##                            PC7    PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.05507 0.9818 0.97736 0.95598 0.87275 0.84448
## Proportion of Variance 0.02783 0.0241 0.02388 0.02285 0.01904 0.01783
## Cumulative Proportion  0.79606 0.8202 0.84404 0.86689 0.88593 0.90376
##                           PC13    PC14   PC15    PC16    PC17    PC18
## Standard deviation     0.78612 0.76769 0.7127 0.66244 0.61396 0.57112
## Proportion of Variance 0.01545 0.01473 0.0127 0.01097 0.00942 0.00815
## Cumulative Proportion  0.91921 0.93394 0.9466 0.95761 0.96703 0.97519
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.50907 0.47826 0.39221 0.35429 0.23741 0.21575
## Proportion of Variance 0.00648 0.00572 0.00385 0.00314 0.00141 0.00116
## Cumulative Proportion  0.98167 0.98738 0.99123 0.99437 0.99578 0.99694
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.16957 0.14476 0.13147 0.12521 0.10815 0.10095
## Proportion of Variance 0.00072 0.00052 0.00043 0.00039 0.00029 0.00025
## Cumulative Proportion  0.99766 0.99818 0.99862 0.99901 0.99930 0.99955
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.09839 0.07002 0.04766 0.01947 0.01723 0.01414
## Proportion of Variance 0.00024 0.00012 0.00006 0.00001 0.00001 0.00000
## Cumulative Proportion  0.99980 0.99992 0.99998 0.99999 0.99999 1.00000
##                            PC37     PC38    PC39     PC40
## Standard deviation     0.009256 0.002361 0.00184 0.001806
## Proportion of Variance 0.000000 0.000000 0.00000 0.000000
## Cumulative Proportion  1.000000 1.000000 1.00000 1.000000
# Result: There are 40 principal components because there are 40 variables
# Result: Principal Component 1 accounts for 32.01% of the variation
# Result: Principal Component 2 accounts for 22.98% of the varation
# Result: Principal Component 3 accounts for 9.98% of the variation
# Result: Principal Component 4 accounts for 5.89% of the varation
# Result: Principal Components 1 through 4 cumulitatively account for 70.86% of the variation

Visually Displaying Variance

# Bar Chart showing Principal Components accounting for Variance
principal_components_variance<-(principal_component_analysis_normalized$sdev^2 / sum(principal_component_analysis_normalized$sdev^2))*100
barplot(principal_components_variance, las=2, xlab="Principal Component", ylab="% Variance Explained", main="Principal Components versus Percent of Variance Explained")

# Scree Plot to determine Optimal Principal Components
screeplot(principal_component_analysis_normalized, type="line")

# Result: Elbow located at 4 principal components

Based on the scree plot and the cumulitative accounted variance, I have chosen to work with the first 8 principal components because this accounts for 82.02% of the variance within the data set.

Observing Loading Weights for PCA

# Observing Loading Weights for PCA
options(max.print=1000000)
principal_component_analysis_normalized$rotation[,1:8]
##                                                     PC1          PC2
## 9_Enrollment                                0.145457125 -0.258260450
## 10_Enrollment                               0.156787762 -0.254891889
## 11_Enrollment                               0.158660019 -0.254630711
## 12_Enrollment                               0.163197565 -0.250045953
## SP_Enrollment                              -0.016351699 -0.148684296
## TOTAL_Enrollment                            0.156040340 -0.258881104
## % First Language Not English               -0.166587294 -0.192898883
## % English Language Learner                 -0.175272010 -0.156455968
## % Students With Disabilities               -0.146984180  0.061361539
## % High Needs                               -0.237627082 -0.068411853
## % Economically Disadvantaged               -0.233950400 -0.056280959
## % African American                         -0.189101578 -0.132429667
## % Asian                                     0.031318531 -0.155415693
## % Hispanic                                 -0.185382272 -0.115288429
## % White                                     0.211398443  0.178753521
## % Native American                          -0.016591943  0.006038614
## % Native Hawaiian, Pacific Islander         0.001025217 -0.021122762
## % Multi-Race, Non-Hispanic                  0.002916159  0.019206205
## % Males                                    -0.148425881  0.046705660
## % Females                                   0.148404439 -0.047415795
## Total # of Classes                          0.155098082 -0.219590238
## Average Class Size                          0.110025173 -0.155083364
## Number of Students                          0.154731179 -0.259762511
## Salary Totals                              -0.203031688 -0.174093839
## Average Salary                             -0.060634344 -0.165184324
## FTE Count                                  -0.206459707 -0.176636712
## In-District Expenditures                   -0.203282676 -0.170736700
## Total In-district FTEs                     -0.206342657 -0.177507963
## Average In-District Expenditures per Pupil -0.090220157 -0.070287467
## Total Expenditures                         -0.203159680 -0.169757799
## Total Pupil FTEs                           -0.206785141 -0.175143482
## Average Expenditures per Pupil             -0.077051036 -0.078728674
## # in Cohort                                 0.151139346 -0.255019066
## % Graduated                                 0.218395443 -0.019225401
## % Still in School                          -0.200951747 -0.014470668
## % Non-Grad Completers                      -0.086900914  0.022951067
## % GED                                      -0.117271472  0.036118545
## % Dropped Out                              -0.186513113  0.031935317
## % Permanently Excluded                     -0.011406523  0.029100904
## High School Graduates (#)                   0.172179298 -0.239126017
##                                                     PC3          PC4
## 9_Enrollment                               -0.140289592 -0.008883509
## 10_Enrollment                              -0.127193286 -0.022867515
## 11_Enrollment                              -0.119242271 -0.014993824
## 12_Enrollment                              -0.113051183 -0.025790855
## SP_Enrollment                              -0.126235679  0.104269221
## TOTAL_Enrollment                           -0.109242402 -0.011687264
## % First Language Not English               -0.037705980  0.103278330
## % English Language Learner                 -0.021671192  0.117317797
## % Students With Disabilities               -0.222739036 -0.188973083
## % High Needs                               -0.178218056 -0.022174795
## % Economically Disadvantaged               -0.177772708  0.031767652
## % African American                          0.072857023 -0.023400034
## % Asian                                     0.059622848 -0.025769174
## % Hispanic                                 -0.157093753  0.126749133
## % White                                     0.060907739 -0.050624090
## % Native American                           0.005232498 -0.258905206
## % Native Hawaiian, Pacific Islander         0.055507634  0.067567626
## % Multi-Race, Non-Hispanic                 -0.088258716 -0.268834879
## % Males                                    -0.286212749 -0.204022796
## % Females                                   0.286976098  0.204145358
## Total # of Classes                         -0.132236584 -0.097939423
## Average Class Size                          0.202355609  0.210311991
## Number of Students                         -0.106380590 -0.008663606
## Salary Totals                               0.191333309  0.023383681
## Average Salary                              0.156278997 -0.243210316
## FTE Count                                   0.173302157  0.041650235
## In-District Expenditures                    0.195966956  0.018582945
## Total In-district FTEs                      0.170568390  0.048120835
## Average In-District Expenditures per Pupil  0.116447849 -0.502198200
## Total Expenditures                          0.198006005  0.018919096
## Total Pupil FTEs                            0.174707329  0.046445899
## Average Expenditures per Pupil              0.123610197 -0.514394347
## # in Cohort                                -0.138299634  0.004145160
## % Graduated                                 0.242788249 -0.091564005
## % Still in School                          -0.099006593  0.076846770
## % Non-Grad Completers                      -0.263035419  0.048176497
## % GED                                      -0.157736148  0.131759924
## % Dropped Out                              -0.244767374  0.067145012
## % Permanently Excluded                     -0.044928905  0.047878900
## High School Graduates (#)                  -0.105953143 -0.027334619
##                                                     PC5          PC6
## 9_Enrollment                                0.002172601 -0.012607948
## 10_Enrollment                               0.001265358  0.005533991
## 11_Enrollment                              -0.011627060  0.020880298
## 12_Enrollment                              -0.015033238  0.040044454
## SP_Enrollment                              -0.045312611  0.092413747
## TOTAL_Enrollment                           -0.020019203  0.012545436
## % First Language Not English                0.190349490 -0.218686212
## % English Language Learner                  0.124364630 -0.175744825
## % Students With Disabilities               -0.053818545  0.265618126
## % High Needs                                0.058171560 -0.113703514
## % Economically Disadvantaged                0.009660945 -0.132515575
## % African American                         -0.167473616  0.036060831
## % Asian                                    -0.069124813  0.057124175
## % Hispanic                                  0.289395364 -0.212368959
## % White                                    -0.072875735  0.139608325
## % Native American                          -0.263312510 -0.470610544
## % Native Hawaiian, Pacific Islander        -0.491562729 -0.261007866
## % Multi-Race, Non-Hispanic                 -0.338215449 -0.355299457
## % Males                                    -0.114757335  0.290926088
## % Females                                   0.115545597 -0.292727110
## Total # of Classes                          0.026712974 -0.019981263
## Average Class Size                          0.063213773 -0.088322333
## Number of Students                         -0.025807566  0.015755330
## Salary Totals                              -0.092116261  0.103564243
## Average Salary                             -0.017152703  0.210158387
## FTE Count                                  -0.083425097  0.092485077
## In-District Expenditures                   -0.090451871  0.102492437
## Total In-district FTEs                     -0.092379017  0.094869989
## Average In-District Expenditures per Pupil  0.274846537 -0.120465987
## Total Expenditures                         -0.089334535  0.101702606
## Total Pupil FTEs                           -0.089095391  0.093186008
## Average Expenditures per Pupil              0.261095289 -0.106883084
## # in Cohort                                -0.021102398  0.019429254
## % Graduated                                 0.018016480  0.037829618
## % Still in School                           0.024080569 -0.035958540
## % Non-Grad Completers                      -0.125223180 -0.017780932
## % GED                                       0.034478883 -0.102625741
## % Dropped Out                              -0.021826987 -0.018002573
## % Permanently Excluded                      0.385033284 -0.014350121
## High School Graduates (#)                  -0.030579696  0.056220360
##                                                     PC7           PC8
## 9_Enrollment                                0.002958142 -0.0183926372
## 10_Enrollment                              -0.007179624 -0.0145879558
## 11_Enrollment                              -0.009447699 -0.0006552868
## 12_Enrollment                              -0.005460741 -0.0229093166
## SP_Enrollment                               0.228237722 -0.4882157261
## TOTAL_Enrollment                           -0.009527584 -0.0189276245
## % First Language Not English                0.024295752  0.2074169083
## % English Language Learner                  0.146870753  0.0941314910
## % Students With Disabilities                0.205796984 -0.0113419432
## % High Needs                                0.115794193  0.0183826776
## % Economically Disadvantaged                0.060304475 -0.0244386902
## % African American                         -0.018720174 -0.0634096985
## % Asian                                    -0.340767500  0.4175019752
## % Hispanic                                  0.118043115  0.0893506187
## % White                                     0.029926558 -0.1176289812
## % Native American                           0.188853997 -0.2162210677
## % Native Hawaiian, Pacific Islander         0.033331051  0.0635026186
## % Multi-Race, Non-Hispanic                 -0.431031123 -0.1362826467
## % Males                                    -0.063782864  0.0667108502
## % Females                                   0.062684827 -0.0678999631
## Total # of Classes                          0.043622271 -0.0658517127
## Average Class Size                         -0.065630991  0.1408039081
## Number of Students                         -0.012458246 -0.0235029661
## Salary Totals                              -0.025513348 -0.0710246674
## Average Salary                             -0.153235612  0.1039608799
## FTE Count                                  -0.015018015 -0.0713873986
## In-District Expenditures                   -0.024297820 -0.0761239659
## Total In-district FTEs                     -0.016762741 -0.0684489247
## Average In-District Expenditures per Pupil  0.063525662  0.0295306468
## Total Expenditures                         -0.024155231 -0.0760013447
## Total Pupil FTEs                           -0.014810823 -0.0706979375
## Average Expenditures per Pupil              0.033175305  0.0664399673
## # in Cohort                                -0.010784017 -0.0181016799
## % Graduated                                 0.125540621  0.0385032998
## % Still in School                           0.023022011 -0.1153096763
## % Non-Grad Completers                       0.200958114  0.2600894414
## % GED                                      -0.399615334  0.1489119945
## % Dropped Out                              -0.237503102 -0.0777374935
## % Permanently Excluded                     -0.439295359 -0.4954971728
## High School Graduates (#)                  -0.028440172 -0.0023483273
# Result: PC1 = Positive (% White, % Graduated)
# Result: PC1 = Negative (% High Needs, % Economically Disadvantaged)
# Interpretation = High Performing Public Schools in Affluent Areas

# Result: PC2 = Positive (% White)
# Result: PC2 = Negative (Total Enrollment, Number of Students)
# Interpretation = Public Schools in Affluent Areas

# Result: PC3 = Positive (% Females, Average Class Size, % Graduated)
# Result: PC3 = Negative (% Students with Disabilities, % Dropped Out, % Non-Grad Completers)
# Interpretation = High Performing Public Schools in Afluent Neighborhood

# Result: PC4 = Positive (% Females, Average Class Size)
# Result: PC4 = Negative (% Native American, % Multi-race, Average Expenditures per Pupil)
# Interpretation = Public Schools in Afluent Neighborhood

# Result: PC5 = Positive (% Hispanic, % Permanently Excluded)
# Result: PC5 = Negative (% Mutli-race)
# Interpretation = Inner City Public Schools

# Result: PC6 = Positive(% Students with Disabilities, High Needs, % Males, Average Salary)
# Result: PC6 = Negative(Hispanic, First Language Not English)
# Interpretation = Special Education Schools

# Result: PC7 = Positive(% Students with Disabilities, % Non-grad Completers)
# Result: PC7 = Negative(% Permanently Excluded, % GED, % Multi-race)
# Interpretation = Challenging Special Education Schools

# Result: PC8 = Positive(% Asian, % First Language Not English, % Non-grad Completers)
# Result: PC8 = Negative(% Permanently Excluded, Native American)
# Interpretation = Schools with High Rate of Foreign Exchange Students

As cluttered as this appears, it does have significant value in how I proceeded. Simply put, I found myself listing the stronger variables for a group and weaker variables for a group, and then placing my interpretation on what the school might look like. For example, in the first group the schools have a high percentage of white students and a high graduation rate. Whilst having a low percentage of high needs and economically disadvantaged students. My initial assumption was that these are private schools but this data set only contains public schools. This made me reflect hard about what actually defines a school, because currently I am labeling groups based on my experience and prejudice, as opposed to hard facts. So I changed my biased interpretation to high performing public schools in affluent areas. I did the same for the other groups however I noticed a few things.

Key Takeaway 1

In the first group, a high percentage of white students is grouped with a high graduation rate.

Key Takeaway 2

In the third group a high female population and average class size is also linked to a high graduation rate.

Key Takeaway 3

In the fifth group a high Hispanic population is grouped with a high percentage of students being permanently excluded. These are valuable insights moving forward because I’m expecting graduation rate to be linked to gender and race which sounds terrible because how can you control these within a school. Shouldn’t I be excluding them? After thinking hard and long about this I decided that these variable do define schools, as unfair as it may seem, the reality is that many schools are defined by race and gender. Being aware of this moving forward, changes how I approach decision trees in the next section.

Key Takeaway 4

Lastly, this simple clustering exercise also highlighted the fact that I have correlated variables. Of course if the total enrollment is low, the number of students will also be low.


Tree Models

Purpose

I chose to model this data using decision trees because I believe they are the most visually intuitive modeling method for individuals outside of the data analytics world. Furthermore, I’m looking for groups of schools that are succeeding or failing. Tree Models have a structure such that the first node symbolizes the most affective variable on the target variable which in essence allows specific variables to be targeted for improvement.

Varying Tree Models

Given the complexity of how schools can be defined, instead of making one single decision tree. I decided to break up the groups into categories for further analysis. Furthermore, the following trees use every record in the data set, in other words there are no test and training sets due to how small the set is to begin with and wanting to see as many variations as possible.

Preparing Data for Tree Model

# Removing all identifying variables (i.e. name, location etc.)
Total_Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(-1:-6,-40:-46)]

# Loading packages needed to plot tree diagrams
library(rpart)
library(rpart.plot)

First Tree Model - All Variables

# Creating Tree with all non-identifying unique variables
set.seed(123)
graduation_total_rpart <- rpart(Total_Public_High_Schools_Subset$Graduation_80_Percent~., method="class", parms = list(split="gini"), data=Total_Public_High_Schools_Subset)
rpart.plot(graduation_total_rpart, type=0, extra=101)

This first model contains all of the variables and it shows that the most dominant variable determining graduation rate is the percentage of students with high needs, followed by economically disadvantaged, expenditures and so fourth. This was my initial tree and I had a hard time justifying changes based on this model. For example if I was presenting this to a new high school principal, their first argument would be that my model shows only high graduation rates for schools with percentages of high need students lower than 68%. Furthermore, the lower nodes filter by the types of students which are not actionable steps. A public school can’t necessarily control the gender, special enrollment or students with disabilities.

Second Tree Model - Race

# Creating a High School Subset that only contains race and graduation rate
Race_Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(-1:-17,-25:-46)]

# Creating Tree Diagram for Race
set.seed(123)
graduation_race_rpart <- rpart(Race_Public_High_Schools_Subset$Graduation_80_Percent~., method="class", parms = list(split="gini"), data=Race_Public_High_Schools_Subset)
rpart.plot(graduation_race_rpart, type=0, extra=101)

# Result: Major Variable = % Hispanic, % Asian

Just looking at the race of students to determine graduation rate, if the Hispanic population within a school is lower than 14% then the school is likely to have a graduation rate higher than 80%. Without my cluster analysis, I would have struggled to justify this seemingly concrete finding. From the last slide we saw that a high Hispanic population is grouped with a high exclusion rate.

Third Tree Model - Gender

# Creating a High School subset that only contains classroom gender and graduation rate
Gender_Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(-1:-24,-27:-46)]

# Creating Tree Diagram for Gender
set.seed(123)
graduation_gender_rpart <- rpart(Gender_Public_High_Schools_Subset$Graduation_80_Percent~., method="class", parms = list(split="gini"), data=Gender_Public_High_Schools_Subset)
rpart.plot(graduation_gender_rpart, type=0, extra=101)

# Result: Major Variable = % Males, % Females

The tree diagram by gender highlights the same concept as the cluster model. A high percentage of females leads to a high graduation rate. Part of me wonders if this because females are smarter than males or if females are less likely to be excluded and are more focused on graduation.

Fourth Tree Model - Student Type

# Creating a High School subset that only contains student type and graduation rate
Type_Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(-1:-12,-18:-46)]

# Creating Tree Diagram for Student Type
set.seed(123)
graduation_type_rpart <- rpart(Type_Public_High_Schools_Subset$Graduation_80_Percent~., method="class", parms = list(split="gini"), data=Type_Public_High_Schools_Subset)
rpart.plot(graduation_type_rpart, type=0, extra=101)

# Result: Major Variable = % High Needs

The types of students within schools highlights this idea that general public schools are not capable to handling students with high needs, therefore large numbers of high needs students leads to a lower graduation rate.

Fifth Tree Model - Classroom Statistics

# Creating a High School subset that only contains classroom statistics and graduation rate
Classroom_Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(-1:-27,-30:-38,-40:-46)]

# Creating Tree Diagram for Classroom Statistics
set.seed(123)
graduation_classroom_rpart <- rpart(Classroom_Public_High_Schools_Subset$Graduation_80_Percent~., method="class", parms = list(split="gini"), data=Classroom_Public_High_Schools_Subset)
rpart.plot(graduation_classroom_rpart, type=0, extra=101)

# Result: Major Variable = Number of Students, Average Class Size

Looking at classroom statistics, its apparent that school size is the largest factor in determining graduation rate, followed by average class size. This is interesting because smaller schools appear to have a lower graduation rate, and my thought is that these are the schools that are dedicated to more specific children, for example success centers for children who have repeatedly been excluded. Looking at the traditional high schools which are greater than 300 students, it confirms the original belief that class sizes smaller than 23 lead to a high graduation rate.

Sixth Tree Model - Financials

# Creating a High School subset that only contains financials
Financials_Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(-1:-29,-39:-46)]

# Creating Tree Diagram for Classroom Statistics
set.seed(123)
graduation_financials_rpart <- rpart(Financials_Public_High_Schools_Subset$Graduation_80_Percent~., method="class", parms = list(split="gini"), data=Financials_Public_High_Schools_Subset)
rpart.plot(graduation_financials_rpart, type=0, extra=101)

# Result: Major Variable = Total Expenditures, Average Salary

Lastly, looking at the financials, its unclear as to what exactly is happening since expenditures lower than a given amount lead to high graduation rates. Again, I think this may be due to more specialized schools requiring more funding but with what I have, this is enough to create my final decision trees.

Final Tree - Major Variables from previous Trees

# Creating a High School subset that only contains major variables
Major_Public_High_Schools_Subset<-Public_High_Schools_Subset[,c(16,20,21,25,26,28,29,31,36,47)]

# Creating Tree Diagram Considering all Major Variables
set.seed(123)
graduation_major_rpart <- rpart(Major_Public_High_Schools_Subset$Graduation_80_Percent~., method="class", parms = list(split="gini"), data=Major_Public_High_Schools_Subset)
rpart.plot(graduation_major_rpart, type=0, extra=101)

So taking the top two variables from each individual tree, I was able to produce a more reasonable tree. Again the percentage of high needs students is the determining factor but, in this tree you can see that there are schools with a high percentage of high needs students with high graduation rates. Furthermore, there are actionable steps to take to determine with direction on a node a school goes. For example, if your high needs students form 60% of your population, but the average salary of teachers and staff is greater than 77,000, the school has a high graduation rate. Even though percentages of high needs greater than 68% still show low graduation rates, I can suggest the middle portion of this tree as actionable steps to increasing graduation rate since it has worked for high-ish percentages. It may just be that in the data set there are no cases where salary varies for needs greater than 68%, therefore the distinction can’t be made. This does not mean that it won’t work though! Due to there being actionable steps as nodes in the tree, I am much more comfortable with this model than the first model whereby we simply threw in every variable.

Model Correct Classification Incorrect Classification Accuracy
1 339 26 92.87%
2 338 27 92.60%

Looking at the statistics of each tree, they are very similar. The number of schools that are assigned correctly vs the number of schools that are assigned incorrectly gives a significantly high accuracy rate. However for the reasons I just mentioned, the second tree that required my instinctive decisions to be made, is more actionable and therefore meaningful.


Major Findings

Major findings from this project are as follows. The factors determining if a schools has a high graduation rate is firstly

Percentage of High Needs Students

The percentage of students who are high needs. To me, this implies that general public schools are not equipped to handle students with greater issues than those in the classroom. Therefore they struggle to motivate and engage students to focus on graduating.

Average Teacher Salary

Secondly, average teacher salary. Challenging districts often have the lowest salaries and in turn the lowest requirements to become a teacher which can result in unqualified and or apathetic teachers. I can attest to this first hand, I had a bachelors degree with no experience and I was thrown into a classroom with nothing but my general intuition.

Class and School Size

Lastly, class and school size. Smaller class sizes allows for more personalized and goal-orientated learning through truly understanding each students learning methods, as opposed to just teaching to the masses. Far too often a teacher is overwhelmed with classroom management when class sizes are large which prevents them from delivering the material in an inspiring manner. Think about individuals selling hand made gifts. Everyone wants a personal touch on their product, but the larger the quantity, the less attention each product receives.


Limitations

As always, I would be lying if I said that there wasn’t limitations to these findings.

Improvements

With limitations there are always improvements so to combat all of the limitations I just mentioned, here they are


Recommendations

With all that being said and done, here are my final recommendations directed towards principals and district officials.

First Recommendation

For schools with a high percentage of high needs students, hire more qualified teachers and staff. I was one of these problem teachers. I was never taught how to handle students with larger issues outside of the classroom such as gang violence, drugs, broken homes etc. Having teachers that can work through problems with students can help to re-direct their attitudes towards graduation.

Second Recommendation

Raise the expectations of new teachers. This goes back to teachers being in such high demand that literally anyone can become one which devalues the education system and harms every student that is a part of it. High quality teachers will directly impact graduation rate.

Third Recommendation

Lastly, manage class sizes such that a teacher does not have to focus on teaching 30 students in a generalized manner, and can instead teach 20 students in an interesting and engaging activity. This can be done by balancing the size of schools and classes across the district. And now for my conclusion to this project.


Conclusion

First Reflection

It is clear both from this project and my own experience that every high school is somewhat unique, which means that it can be incredibly difficult to generalize findings and improvements. What might work for one high school may not work for another. But, having those differences evident in the data allows for the chance to account for that uniqueness.

Second Reflection

I personally believe that more emphasis needs to be placed on how high schools are performing. In this data set there were schools with a graduation rate of 8.7%. Regardless of the specifics, is this acceptable? Because I feel as though rates that are this low mean that the school system is not providing the right tools for students to succeed.

Third Reflection

Lastly, as I have found, there is no simple answer. There is no recipe. With so many variables potentially impacting graduation rates, this problem is much harder to address and solve than simply applying models to find the common themes for higher graduation rates. Instead this process should be done over and over with recommendations being implemented each time to find the potentially unique solution for each school. Only continuous analysis can account for the countless changes in school dynamics.


Powerpoint Slides

Title Slide

Title Slide

Proposal Slide

Proposal Slide

Data Source Slide

Data Source Slide

Cleaning the Data Slide

Cleaning the Data Slide

Exploratory Data Analysis Slide

Exploratory Data Analysis Slide

Cluster Model Slide

Cluster Model Slide

Varying Tree Models Slide

Varying Tree Models Slide

Final Tree Models Slide

Final Tree Models Slide

Major Findings Slide

Major Findings Slide

Recommendations & Conclusion Slide

Recommendations & Conclusion Slide